[https://nvbugs/6336747][fix] Fail fast when executor worker stalls#15561
[https://nvbugs/6336747][fix] Fail fast when executor worker stalls#155612ez4bz wants to merge 2 commits into
Conversation
a093cc2 to
a2b7c52
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-*" --disable-reuse-test |
|
PR_Github #55338 [ run ] triggered by Bot. Commit: |
|
PR_Github #55338 [ run ] completed with state
|
a2b7c52 to
af6cae1
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-*" --disable-reuse-test |
|
PR_Github #55353 [ run ] triggered by Bot. Commit: |
|
PR_Github #55353 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55362 [ run ] triggered by Bot. Commit: |
|
PR_Github #55362 [ run ] completed with state
|
af6cae1 to
620220c
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55412 [ run ] triggered by Bot. Commit: |
620220c to
8ce44ce
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55412 [ run ] completed with state
|
|
PR_Github #55427 [ run ] triggered by Bot. Commit: |
|
PR_Github #55427 [ run ] completed with state
|
8ce44ce to
2bf1ae1
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
2bf1ae1 to
1a46405
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55562 [ run ] triggered by Bot. Commit: |
|
PR_Github #55563 [ run ] triggered by Bot. Commit: |
|
PR_Github #55562 [ run ] completed with state |
|
PR_Github #55563 [ run ] completed with state
|
1a46405 to
bfe993b
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
bfe993b to
33ef1ec
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55580 [ run ] triggered by Bot. Commit: |
|
PR_Github #55580 [ run ] completed with state |
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55607 [ run ] triggered by Bot. Commit: |
|
PR_Github #55607 [ run ] completed with state
|
33ef1ec to
183f4f7
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55697 [ run ] triggered by Bot. Commit: |
|
PR_Github #55697 [ run ] completed with state
|
0b1b09d to
807178c
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55708 [ run ] triggered by Bot. Commit: |
807178c to
740a40d
Compare
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
740a40d to
2ff36aa
Compare
|
PR_Github #55708 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
|
|
/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast |
|
PR_Github #55715 [ run ] triggered by Bot. Commit: |
|
PR_Github #55715 [ run ] completed with state
|
@coderabbitai summary
Description
A stuck or disconnected executor worker left the proxy blocked
indefinitely: the request queue uses an unbounded send HWM with no
send timeout, so request_queue.put -> socket.send never returned once
the worker stopped draining, and the error monitor never tripped. In
CI this could surface as a ~1h hang ending in an opaque timeout kill.
The stall itself is non-deterministic and not yet root-caused.
Make the failure fast and legible instead:
check worker liveness, raising RequestError if the worker has not
accepted the request within a timeout.
stalled and aborts in-flight requests when no result arrives while
requests are outstanding.
and bound the per-request wait in the VideoMME evaluator.
dump all thread stacks so the next occurrence is diagnosable.
This mitigates the hang and captures worker state; it does not fix
the underlying intermittent stall.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.